Lecture 5: Adventures in Feature Engineering

Feature Engineering

Where the Rubber Meets the Road

Feature Engineering…

A Mirror to Data Cleaning

In the Context of Linear Regression

Shallow Dive #1: Crime in NYC

Crime Rates by Borough in NYC

A Heat Map

Which One is Right?

But They Can’t Both Be Right

Normalized Rates

How Do We Spot These Issues?

There are a variety of different ways in which one might investigate a claim. Maintaining a healthy skepticism, especially about novel results, is usually a good practice. In this case, I noticed the issue right away due to a few reasons:

What About the Next Disconnection?

Crime Patterns by Time of Day

An Interesting Split

Approaches to Interactions

Average Number of Crimes in Hourly Time Period = \(B_0\) + \(B_1\) * Hour of the day + \(B_2\) * Season + \(B_3\) * (Hour of the day * Season) + …

These are merely a few examples. You could pursue a variety of alternative approaches.

But What is the Best Feature?

Shallow Dive #2: Basketball

A Research Question

A Model of Making the Playoffs

A Variety of Issues

Each of these issues is worthy of some additional discussion.

Issue #1: A High Intercept

Overall, the high intercept was a cause for concern, but its value could be explained relative to some of the other terms.

Issue #2: Offensive Efficiency and Three Point Attempts

By including both offensive efficiency and most of its components – 3 point, 2 pointers, and free throws – we created a situation in which the effect of additional three point attempts was distributed in part to the estimate for offensive efficiency.

Issue #3: How Should We Measure Success?

Comparing Factors Over Time

Then we could more reasonably answer the research question by estimating the effect of attempting an additional standard deviation above average in the number of three pointers relative to other teams on the likelihood that a team would make the playoffs.

Shallow Dive #3: Medicines and Outcomes

Predicting Adverse Medical Events

The Strongest Predictor

Reverse Causality

Reverse Causality and Medicine

Deep Dive: The AirBnB Data

Goal: Use machine learning models to accurately predict the prices of new listings based on their measured features and historical pricing data.

The Data

library(data.table)
dat <- fread(input = "AirBnB analysisData.csv", sep = ",", 
    fill = TRUE)
dim(dat)
[1] 29142    96
test <- fread(input = "AirBnB scoringData.csv", sep = ",", 
    fill = TRUE)
dim(test)
[1] 7286   95

Examples of Features

A Cozy Studio in Midtown, Close to Everything

Meanwhile, you might be surprised how many listings are described as cozy:

description.name <- "description"
price.name <- "price"
cozy.name <- "cozy"
pattern.cozy <- "cozy"
dat[, eval(cozy.name) := 0]
w <- grep(pattern = pattern.cozy, x = dat[, tolower(get(description.name))])
dat[w, eval(cozy.name) := 1]
dat[, mean(get(cozy.name))]
[1] 0.182623
dat[, .(`Mean Price` = mean(get(price.name))), by = cozy.name]
   cozy Mean Price
1:    1   115.0840
2:    0   136.9762

Based upon my own experience in renting apartments from online listings, cozy is a nice way of saying the place is small.

However, Challenges are Everywhere

The following examples are meant to illustrate some of the challenges the students faced in putting together their predictive models.

Challenge #1: Formulae

A number of students initially thought the competition would be as simple as fitting the following model:

mod = lm(formula = "price ~ .", data = dat)

The idea of this model is that the outcome of price should be modeled using all of the other measured variables in the data set.

However, there are some issues to think about before diving in.

Issues with the . Notation

A Lack of Accountability

More Issues with a Formula’s Notation

Creating Better Formulae

create.formula <- function(outcome.name, input.names, input.patterns = NA, 
    all.data.names = NA, return.as = "character") {
    variable.names.from.patterns <- c()
    if (!is.na(input.patterns[1]) & !is.na(all.data.names[1])) {
        pattern <- paste(input.patterns, collapse = "|")
        variable.names.from.patterns <- all.data.names[grep(pattern = pattern, 
            x = all.data.names)]
    }
    all.input.names <- unique(c(input.names, variable.names.from.patterns))
    all.input.names <- all.input.names[all.input.names != 
        outcome.name]
    if (!is.na(all.data.names[1])) {
        all.input.names <- all.input.names[all.input.names %in% 
            all.data.names]
    }
    input.names.delineated <- sprintf("`%s`", all.input.names)
    the.formula <- sprintf("`%s` ~ %s", outcome.name, paste(input.names.delineated, 
        collapse = " + "))
    if (return.as == "formula") {
        return(as.formula(the.formula))
    }
    if (return.as != "formula") {
        return(the.formula)
    }
}

Example #1: Removing a Bogus Name

num.bedrooms.name <- "bedrooms"
num.bathrooms.name <- "bathrooms"
bogus.name <- "Column not in the data"

the.formula.1 <- create.formula(outcome.name = price.name, 
    input.names = c(num.bedrooms.name, num.bathrooms.name, 
        bogus.name), all.data.names = names(dat))
print(the.formula.1)
[1] "`price` ~ `bedrooms` + `bathrooms`"

Example #2: Finding Patterns

pattern.review.scores <- "review_scores_"

the.formula.2 <- create.formula(outcome.name = price.name, 
    input.names = c(num.bedrooms.name, num.bathrooms.name), 
    input.patterns = pattern.review.scores, all.data.names = names(dat))

print(strwrap(x = the.formula.2))
[1] "`price` ~ `bedrooms` + `bathrooms` + `review_scores_rating` +"
[2] "`review_scores_accuracy` + `review_scores_cleanliness` +"     
[3] "`review_scores_checkin` + `review_scores_communication` +"    
[4] "`review_scores_location` + `review_scores_value`"             

Example #3: Column Names with Spaces

square.feet.name <- "square_feet"
square.meters.name <- "Square Meters"
dat[, `:=`(eval(square.meters.name), get(square.feet.name)/10)]

the.formula.3 <- create.formula(outcome.name = price.name, 
    input.names = c(num.bedrooms.name, num.bathrooms.name, 
        square.meters.name), all.data.names = names(dat))

print(the.formula.3)
[1] "`price` ~ `bedrooms` + `bathrooms` + `Square Meters`"

Without placing the names of the variables within accent marks, the formula would generate an error when trying to fit a model.

Example #4: Removing Duplicates

Here we’ll accidentally include the number of bedrooms in the formula twice:

the.formula.4 <- create.formula(outcome.name = price.name, 
    input.names = c(num.bedrooms.name, num.bathrooms.name, 
        square.meters.name, num.bedrooms.name), all.data.names = names(dat))

print(the.formula.4)
[1] "`price` ~ `bedrooms` + `bathrooms` + `Square Meters`"

The duplicated names are automatically removed.

Challenge #2: Contrasts

Meanwhile, when the students did try to fit this model:

mod = lm(formula = "price ~ .", data = dat)

… they were surprised to find that it generated the following error message:

A Lack of a Contrast

Identifying Variables that Lack Contrasts

length.unique <- function(x) {
    return(length(unique(x)))
}
unique.values.tab <- dat[, .(Variable = names(dat), Num_Unique = as.numeric(lapply(X = .SD, 
    FUN = "length.unique")))]
# setorderv(x = unique.values.tab, cols = 'Num_Unique',
# order = 1)
library(DT)
datatable(data = unique.values.tab[Num_Unique == 1, ], rownames = FALSE)

Taking a Look at the Singular Values

dat[1, .SD, .SDcols = unique.values.tab[Num_Unique == 1, 
    Variable]]
        scrape_id experiences_offered thumbnail_url medium_url
1: 20180303203649                none          <NA>       <NA>
   xl_picture_url host_acceptance_rate has_availability requires_license
1:           <NA>                  N/A                t                f
   license
1:      NA

Information Not Available

Each of these factors could plausibly impact the price of a listing, but we have no way of estimating their effect due to a lack of measured data.

Removing Variables

Reducing the Formula

reduce.formula <- function(dat, the.initial.formula, max.categories = NA) {
    require(data.table)
    dat <- setDT(dat)
    
    the.sides <- strsplit(x = the.initial.formula, split = "~")[[1]]
    lhs <- trimws(x = the.sides[1], which = "both")
    lhs.original <- gsub(pattern = "`", replacement = "", 
        x = lhs)
    if (!(lhs.original %in% names(dat))) {
        return("Error:  Outcome variable is not in names(dat).")
    }
    
    the.pieces.untrimmed <- strsplit(x = the.sides[2], split = "+", 
        fixed = TRUE)[[1]]
    the.pieces.untrimmed.2 <- gsub(pattern = "`", replacement = "", 
        x = the.pieces.untrimmed, fixed = TRUE)
    the.pieces.in.names <- trimws(x = the.pieces.untrimmed.2, 
        which = "both")
    
    the.pieces <- the.pieces.in.names[the.pieces.in.names %in% 
        names(dat)]
    num.variables <- length(the.pieces)
    include.pieces <- logical(num.variables)
    
    for (i in 1:num.variables) {
        unique.values <- dat[, unique(get(the.pieces[i]))]
        num.unique.values <- length(unique.values)
        if (num.unique.values >= 2) {
            include.pieces[i] <- TRUE
        }
        if (!is.na(max.categories)) {
            if (dat[, is.character(get(the.pieces[i])) | 
                is.factor(get(the.pieces[i]))] == TRUE) {
                if (num.unique.values > max.categories) {
                  include.pieces[i] <- FALSE
                }
            }
        }
    }
    pieces.rhs <- sprintf("`%s`", the.pieces[include.pieces == 
        TRUE])
    rhs <- paste(pieces.rhs, collapse = " + ")
    the.formula <- sprintf("%s ~ %s", lhs, rhs)
    return(the.formula)
}

Example: Removing Variables with a Lack of Contrasts

medium.url.name <- "medium_url"
orig.formula.5 <- create.formula(outcome.name = price.name, 
    input.names = c(num.bedrooms.name, num.bathrooms.name, 
        medium.url.name))
formula.5 <- reduce.formula(dat = dat, the.initial.formula = orig.formula.5)
print(formula.5)
[1] "`price` ~ `bedrooms` + `bathrooms`"

Challenge #3: Too Many Contrasts

datatable(data = unique.values.tab[Num_Unique >= 20000, 
    ], rownames = FALSE)

What Generates So Many Contrasts?

Categorical Variables in Linear Models

Example: Removing Variables with Too Many Categories

orig.formula.6 <- create.formula(outcome.name = price.name, 
    input.names = c(num.bedrooms.name, num.bathrooms.name, 
        description.name))
formula.6 <- reduce.formula(dat = dat, the.initial.formula = orig.formula.6, 
    max.categories = 30)
print(formula.6)
[1] "`price` ~ `bedrooms` + `bathrooms`"

Tips for Handling Categorical Variables

Challenge #4: How Granular Should the Categories Be?

The location data for each listing in New York City is categorized across a variety of variables:

Each of these variables provides a different degree of precision about the information. How do we decide which factors to incorporate?

Considerations for Selecting the Granularity

Issues of Sample Size

How to proceed is highly dependent on the context of the problem. There is no absolute answer for the correct sample size or degree of granularity.

Interpretative Quality Versus Predictive Accuracy

Challenge #5: Other Levels of Granularity

Commonalities in Neighborhoods:

You Can Create Your Own Features

Additional Features

What About Free Parking?

Who is Offering Parking, and Who Isn’t?

Reverse Causality Strikes Again

If we simply went with the observed data, we would mistakenly conclude that free parking has a negative effect on the price of a listing.

Challenge #6: New Levels in the Testing Data

It would be reasonable to look for neighbourhood-level effects in pricing. Consider the following linear model:

neighborhood.name <- "neighbourhood_cleansed"
the.formula <- create.formula(outcome.name = price.name, 
    input.names = neighborhood.name)
mod <- lm(formula = the.formula, data = dat)
pred <- predict(object = mod, newdata = test)

While the model was estimated properly, the predictions led to the following error message:

The New Level Problem

The same advice applies here: consolidating the categories or dropping the variables from the model may be necessary.

Identifying Variables with New Levels

find.new.levels <- function(old.dat, new.dat, variable.name) {
    require(data.table)
    setDT(old.dat)
    setDT(new.dat)
    old.levels <- old.dat[, unique(get(variable.name))]
    new.levels <- new.dat[!(get(variable.name) %in% old.levels), 
        unique(get(variable.name))]
    return(new.levels)
}
find.new.levels(old.dat = dat, new.dat = test, variable.name = neighborhood.name)
[1] "Hollis Hills" "Westerleigh" 
zipcode.name <- "zipcode"
find.new.levels(old.dat = dat, new.dat = test, variable.name = zipcode.name)
[1]    NA 10080 11363 10550

Challenge #7: Finding a Better Feature

Many of the top entries in the competition incorporated latitude and longitude. Especially using advanced tree methods, the models were able to make insightful splits that led to highly specific geographic predictions.

Non-Linear Geographical Effects

Features Driving Methods

Challenge #8: Number of Bedrooms

dat[, .N, keyby = num.bedrooms.name]
    bedrooms     N
 1:        0  2756
 2:        1 21062
 3:        2  3790
 4:        3  1165
 5:        4   282
 6:        5    58
 7:        6    14
 8:        7    10
 9:        8     3
10:        9     1
11:       11     1

Unusual Listings

Leverage Points

Sensitivity Analyses

There are a variety of approaches to consider:

Forms of Sensitivity Analyses

Challenge #9: Unusual Properties

property.type.name <- "property_type"
property.counts <- dat[, .N, keyby = property.type.name]
property.counts[N <= 5, ]
             property_type N
 1:             Aparthotel 4
 2:                   Boat 4
 3:                  Cabin 1
 4:              Camper/RV 3
 5: Casa particular (Cuba) 1
 6:                 Castle 1
 7:                   Cave 1
 8:                 Chalet 1
 9:            Earth house 1
10:                 Island 1
11:                 Resort 3
12:                   Tent 1
13:             Tiny house 1
14:                  Train 1
15:              Treehouse 1
16:                   Yurt 1

A Tent, a Cave, a Castle, a Chalet, an Island, an Earth House

Issue #10: Seasonality

Seasonality in the AirBnB Data

date.name <- "last_scraped"
dat[, .N, keyby = date.name]
   last_scraped     N
1:   2018-03-04 13022
2:   2018-03-05 15650
3:   2018-03-06   470

But Seasonality is Important

So Let’s Actually Fit a Linear Regression Model

Fitting Functions

linear.regression.summary <- function(lm.mod, digits = 3, alpha = 0.05) {
  lm.coefs <- as.data.table(summary(lm.mod)$coefficients, keep.rownames = TRUE)
  z <- qnorm(p = 1 - alpha/2, mean = 0, sd = 1)
  lm.coefs[, Coef.Lower := Estimate - z * `Std. Error`]
  lm.coefs[, Coef.Upper := Estimate + z * `Std. Error`]
  return(lm.coefs)
}
round.numerics <- function(x, digits){
  if(is.numeric(x)){
    x <- round(x = x, digits = digits)
  }
  return(x)
}
fit.model <- function(dat, the.formula, digits = 3) {
  mod <- lm(formula = the.formula, data = dat)
  mod.summary <- linear.regression.summary(lm.mod = mod, digits = digits)
  mod.summary.rounded <- mod.summary[, lapply(X = .SD, FUN = "round.numerics", digits = digits)]
  return(mod.summary.rounded)
}

A Parsimonious Model

Fitting the Model

the.settings <- c("Apartment", "Loft", "Condominimium")
bedroom.limit <- 4
borough.name <- "neighbourhood_group_cleansed"
rating.name <- "review_scores_rating"
x.names <- c(num.bedrooms.name, borough.name, cozy.name, 
    rating.name)
the.initial.formula <- create.formula(outcome.name = price.name, 
    input.names = x.names, all.data.names = names(dat))
the.formula <- reduce.formula(dat = dat, the.initial.formula = the.initial.formula, 
    max.categories = 30)
print(the.formula)
[1] "`price` ~ `bedrooms` + `neighbourhood_group_cleansed` + `cozy` + `review_scores_rating`"
mod <- fit.model(dat = dat[get(num.bedrooms.name) <= bedroom.limit & 
    get(property.type.name) %in% the.settings], the.formula = the.formula)

Model Results

library(DT)
datatable(data = mod, rownames = FALSE)

Cleaning the Table

old.variable.name <- "rn"
variable.name <- "Variable"
replacement.value.borough <- "Borough: "
setnames(x = mod, old = old.variable.name, new = variable.name)
mod[, `:=`(eval(variable.name), gsub(pattern = borough.name, 
    replacement = replacement.value.borough, x = get(variable.name)))]
datatable(data = mod, rownames = FALSE)

Some Insights

Predictive Power Versus Inference

Products Versus Insights

Settings Prioritizing Insights

Settings Prioritizing Accuracy

You Can Also Do Both

Meeting the Challenges, Mastering the Methods

Understanding these practical challenges can make all the difference in real projects. This is what modeling with real data – messy and imperfect all around – is like. The good news is that, with experience, you’ll be better prepared to handle the next setting. And, better yet, each practical challenge reinforces the methodological lessons you’ve learned!